4¶
Question:
Please do classification survive or not survive from titanic dataset titanic (you can get it from the public portal dataset). Please do some tasks Exploratory Data Analysis (EDA), explain the hyperparameter used. Please explain bias-tradeoff and how you implement in this case
Answer
In [ ]:
import pandas as pd
pd.set_option('future.no_silent_downcasting',True)
dt = pd.read_csv('./data.csv')
In [ ]:
# Descripbe Data Shape
print("Data Shape")
print(dt.shape)
print("--------------")
# Describe overall data
print("Data Info")
print(dt.info(memory_usage=False))
print("--------------")
print("Data Description")
print(dt.describe())
print("--------------")
Data Shape
(891, 12)
--------------
Data Info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)None
--------------
Data Description
PassengerId Survived Pclass Age SibSp \
count 891.000000 891.000000 891.000000 714.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008
std 257.353842 0.486592 0.836071 14.526497 1.102743
min 1.000000 0.000000 1.000000 0.420000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000
50% 446.000000 0.000000 3.000000 28.000000 0.000000
75% 668.500000 1.000000 3.000000 38.000000 1.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000
Parch Fare
count 891.000000 891.000000
mean 0.381594 32.204208
std 0.806057 49.693429
min 0.000000 0.000000
25% 0.000000 7.910400
50% 0.000000 14.454200
75% 0.000000 31.000000
max 6.000000 512.329200
--------------
In [ ]:
dt.head()
Out[ ]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
In [ ]:
from ydata_profiling import ProfileReport
ProfileReport(dt, title="Profiling Report")
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Out[ ]:
Data preprocessing¶
Used Column:
- Survival : Y
- pclass
- sex
- age
- sibsp
- parch
- embarked
- fare
Missing Value
- Age => Fill with mean
- Embarked => Fill with longest one
In [ ]:
dt
Out[ ]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
| 887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
| 888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
| 889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
| 890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
In [ ]:
df = dt.copy()
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
# Feature Selection
df = df[['Survived', 'Pclass', 'Sex', 'Age', 'SibSp','Parch', 'Fare', 'Embarked']]
df['Sex'] = df['Sex'].replace(['male','female'],[1,0])
df['Embarked'] = df['Embarked'].replace(['S','C','Q'],[0,1,2])
# Normalisation using min_max scaller
mms = MinMaxScaler()
for i in df.drop(columns=['Survived']).columns.to_list():
df[i] = mms.fit_transform(df[[i]])
# Add Null value
# Age => Fill with mean :
#
df = df.fillna({
"Age": 29,
"Embarked": 0,
})
# split to training and test
x = df.drop(columns=['Survived'])
y = df['Survived'].astype('int')
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=2)
Data processing¶
In [ ]:
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, f1_score, precision_score
%matplotlib inline
# DEFINE Hyperparamater
param_grid = {
'max_depth': [3, 5, 7, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
# Deine Classifier
clf_t = DecisionTreeClassifier()
# Define cross validation strtegy
cv = KFold(n_splits=5, shuffle=True, random_state=42)
grid_search = GridSearchCV(clf_t, param_grid, cv=cv, scoring='accuracy')
grid_search.fit(x_train, y_train)
print("Best Hyperparamater", grid_search.best_params_)
print("Best Score", grid_search.best_score_)
# Using cross validation use hyperparamate
clf = DecisionTreeClassifier(max_depth=3, min_samples_leaf=1,min_samples_split=2)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print(confusion_matrix(y_test, y_pred))
print(f"Accuracy : {accuracy_score(y_test, y_pred)}")
print(f"Precision : {precision_score(y_test, y_pred)}")
print(f"Recall : {recall_score(y_test, y_pred)}")
print(f"F1-Score : {f1_score(y_test, y_pred)}")
plot_tree(clf)
Best Hyperparamater {'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 2}
Best Score 0.8230473751600511
[[91 9]
[27 52]]
Accuracy : 0.7988826815642458
Precision : 0.8524590163934426
Recall : 0.6582278481012658
F1-Score : 0.7428571428571429
Out[ ]:
[Text(0.5, 0.875, 'x[1] <= 0.5\ngini = 0.466\nsamples = 712\nvalue = [449, 263]'), Text(0.25, 0.625, 'x[0] <= 0.75\ngini = 0.397\nsamples = 253\nvalue = [69, 184]'), Text(0.125, 0.375, 'x[2] <= 0.026\ngini = 0.111\nsamples = 135\nvalue = [8, 127]'), Text(0.0625, 0.125, 'gini = 0.5\nsamples = 2\nvalue = [1, 1]'), Text(0.1875, 0.125, 'gini = 0.1\nsamples = 133\nvalue = [7, 126]'), Text(0.375, 0.375, 'x[5] <= 0.046\ngini = 0.499\nsamples = 118\nvalue = [61.0, 57.0]'), Text(0.3125, 0.125, 'gini = 0.488\nsamples = 95\nvalue = [40, 55]'), Text(0.4375, 0.125, 'gini = 0.159\nsamples = 23\nvalue = [21, 2]'), Text(0.75, 0.625, 'x[2] <= 0.076\ngini = 0.285\nsamples = 459\nvalue = [380, 79]'), Text(0.625, 0.375, 'x[3] <= 0.25\ngini = 0.415\nsamples = 17\nvalue = [5, 12]'), Text(0.5625, 0.125, 'gini = 0.0\nsamples = 11\nvalue = [0, 11]'), Text(0.6875, 0.125, 'gini = 0.278\nsamples = 6\nvalue = [5, 1]'), Text(0.875, 0.375, 'x[0] <= 0.25\ngini = 0.257\nsamples = 442\nvalue = [375, 67]'), Text(0.8125, 0.125, 'gini = 0.44\nsamples = 98\nvalue = [66.0, 32.0]'), Text(0.9375, 0.125, 'gini = 0.183\nsamples = 344\nvalue = [309, 35]')]